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Optimal Convergence Rate in Feed Forward Neural 
Networks using HJB Equation 

Vipul Arora, Laxmidhar Behera and Ajay Pratap Yadav 


Abstract —A control theoretic approach is presented in this 
paper for both hatch and instantaneous updates of weights in 
feed-forward neural networks. The popular Hamilton-Jacohi- 
Bellman (HJB) equation has been used to generate an optimal 
weight update law. The remarkable contribution in this paper 
is that closed form solutions for both optimal cost and weight 
update can be achieved for any feed-forward network using HJB 
equation in a simple yet elegant manner. The proposed approach 
has been compared with some of the existing best performing 
learning algorithms. It is found as expected that the proposed 
approach is faster in convergence in terms of computational 
time. Some of the benchmark test data such as 8-bit parity, 
breast cancer and credit approval, as well as 2D Gabor function 
have been used to validate our claims. The paper also discusses 
issues related to global optimization. The limitations of popular 
deterministic weight update laws are critiqued and the possibility 
of global optimization using HJB formulation is discussed. It is 
hoped that the proposed algorithm will bring in a lot of Interest 
in researchers working in developing fast learning algorithms 
and global optimization. 



Fig. 1. Schematic of a two-layer FFNN. 


methods ensure fast convergence but can not guarantee global 
convergence. 


I. Introduction 

S INCE the advent of the popular back propagation (BP) 
algorithm ITJ, issues concerning global convergence as 
well as fast convergence have caught the attention of re¬ 
searchers. One of the very first works by Hagan and Menhaj 
El makes use of the Levenberg-Marquardt (LM) algorithm 
for training weights of multi-layered networks in batch mode. 
Gradient descent scheme that is at the heart of BP uses only 
first derivative. Newton’s method enhances the performance by 
making use of the second derivative which is computationally 
intensive. Gauss Newton’s method approximates the second 
derivative with the help of hrst derivatives and hence, provides 
a simpler way of optimization. The Levenberg-Marquardt 
(LM) algorithm El^ El modihes the Gauss Newton’s method 
to improve the convergence rate, by interpolating between 
the gradient descent and the Gauss-Newton method. Faster 
convergence has been reported using extended Kalman Filter¬ 
ing (EKF) approach ||4| and recursive least square approach 
0. Cong and Liang ||6| derive an adaptive learning rate for 
a particular structure of neural networks using resilient BP 
algorithm with an aim of stability in uncertain environment. 
Man et ai. El use BP with Lyapunov stability theory, while 
Mohseni and Tan lO use BP with variable structure system 
for robust and fast convergence. Using Lyapunov function 
and stability principle, Behera et ai. Q propose a scheme for 
adaptive learning rate associated with BP algorithm. The above 

The authors are with the Department of Electrical Engineering, Indian Insti¬ 
tute of Technology, Kanpur - 208016 (India). E-mail: vipular.iitk@gmail.com, 
lbehera@iitk.ac.in, ajaypratapyadav@gmail.com 


A. Global Optimization in FFNN 

Let us consider modulo-2 function / : ^ R. Here, 9 

patterns are formed using xi G {0,1,2} and X 2 G {0,1,2}. 
The desired output is given by 


y = f{xi,x2) 


1 if {xi + X 2 ) is odd 
0 if {xi + X 2 ) is even 


( 1 ) 


The patterns formed by this function are shown in Table U 


TABLE I 

Patterns formed by Modulo-2 function 
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We consider a two-layer FFNN, with structure, 

where Nig , , Ni^ stand for the number of neurons in the 

input, hidden and output layers, respectively. Fig. [T] shows a 
block schematic of the same. The non-linearity is introduced 
by using a sigmoidal activation function at each neuron. 
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(a) (b) 

Fig. 2. Modulo-2 function: the two axes carry the two inputs; red circles 
and black stars represent output as 0 and 1, respectively. Blue lines show the 
separators of the layer 1 of a 3-4-1 FFNN. (a) Global minimum reached; (b) 
Weights stuck in a local minimum. 


Each neuron acts as a linear separator which divides its input 
space into two regions. Its output changes smoothly, from 0 
in one region to 1 in the other, along the sigmoidal function. 
For the modulo-2 function. Fig. |2ta) shows one possible 
configuration of separators in the layer-1. The output (G R"') of 
layer-1 for this configuration is linearly separable. This shows 
that the modulo-2 function can be modeled with the help of a 
3-4-1 FFNN, where a bias (h- 1) is the third input. Also, since 
the pattern space is symmetric about the line xi = 1, we can 
get another configuration of separators reflected about xi = 1. 
This shows that multiple global minima may be possible for a 
problem. However, local minima also exist for this example, 
such as the one shown in Fig. |2jb). 

The success score is defined as the number of trials which 
successfully converge to zero output error over the total 
number of trials. 


Success score = 


No. of successfully converging trials 
Total no. of trials 


( 6 ) 


If the success score is 100% then it implies that the network 
converges to the global minimum always. 

For training an FFNN, several trials are run, each initializing 
from a different set of weights. For implementation, an offset 
of 0.1 is added to all the inputs, and in the output, 0 and 1 
are replaced by 0.1 and 0.9, respectively. For a 3-4-1 FFNN 
trained using BP algorithm in batch mode, the success score 
is observed to be around 25%. This means 75% times the 
network is stuck in a local minimum. On increasing the 
number of hidden neurons to 6, the success score rises to 
around 80%, and further to 90% with 10 hidden neurons. The 
success score reaches close to 100% with 15 hidden neurons. 

This simple experiment shows that the mapping of the 
modulo-2 function using 3-4-1 FFNN generates both local and 
global minima in error surface. However by increasing the 
number of hidden neurons to 15 or beyond, the error surface 
ceases to have local minima as the above experiment shows. 
Instead if we consider a XOR function, then the score becomes 
always 100% irrespective of the network architecture as there 
is no local minimum in the error-surface. This experiment 
amply suggests that the type of the function and the network 
architecture together determine if the etTor surface will have 
local minimum. This modulo-2 function has been used in this 


paper as a benchmark to study global convergence in a FFNN. 

B. Previous Approaches 

The approaches for optimization can broadly be classified 
into two categories; deterministic and probabilistic. 

The deterministic methods aim at finding the minimum of 
the objective function by evaluating the function iteratively 
in steps, where each step is determined using the derivatives 
of the function. A simple local search method is gradient 
descent based search which steps in the direction of gradient 
of the function. For FFNN, back propagation (BP) training 
is based on the gradient descent scheme. In order to And 
a global minimum, a popular method is to move from the 
current local minimum to a lower minimum by transforming 
the objective function. Ng et aJ. Col discuss various auxiliary 
functions which are designed to transport the algorithm from 
current minimum to basins of better minima, using gradient 
descent. For training of FFNN’s, Shang and Wah CD use a 
trajectory based method that depends on heuristic functions to 
search the minima. Toh lfT2l uses deterministic line search over 
the monotonic transformation of error function to progress to 
lower minima. Gan and Fi 1131 use FM algorithm for non¬ 
linear least square optimization. 

The probabilistic methods, on the other hand, may either 
proceed in steps, with step size defined stochastically, or else 
choose initial points for deterministic local searches. Simulated 
annealing associates a time varying probability distribution 
function with the parameter search space, such that the pa¬ 
rameter configurations with lower values of objective function 
have more probability. The parameters changes from one 
configuration to another based on stochastic jumps until they 
reach the configuration with maximum probability. Another 
popular approach uses genetic algorithms, which use heuris- 
tically rules inspired from the biological evolution models. 
For training FFNN’s, Boese and Kahng Gl use simulated 
annealing, while Tsai et al. CD and Rob et al. ISl use genetic 
algorithms. Sexton et al. ini compare the genetic algorithms 
and simulated annealing based approaches. Fidermir et al. Qa 
combine simulated annealing and tabu search with gradient 
descent so as to optimize the FFNN weights and architecture. 
Delgado et al. m use hybrid algorithms for evolutionary 
training of recuri'ent NN’s. 

In this paper, we want a theoretical investigation if there 
exists a possibility of finding a deterministic weight update 
law that can always ensure global minimum, starting from 
any random initialization of weights. 

C. Contribution and organization of the paper 

Researchers have tried to solve global optimization in 

FFNN using random weight search and such approaches 
have remained heuristic at best. The deterministic approaches 
to global optimization lEl is highly dependent on search 
direction whose computation is again a heuristic process. 
We are surprised to And that nobody has ever thought of 
using Hamilton-Jacobi-Bellman (HJB) equation for solving 
this problem. The HJB equation comes from dynamic pro¬ 
gramming II 20 I . which is a popular approach for optimal 
control of dynamical systems ED, ED. 
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In this paper, the weight update law has been converted into 
a control problem and dynamic optimization has been used to 
derive the update law. The derivation of the HJB based weight 
update law is surprisingly simple and straightforward. The 
closed form solution for the optimal cost and optimal weight 
update law have simple structures. The proposed approach 
has been compared with some of the existing best performing 
learning algorithms and is found to be faster in convergence in 
terms of computational time. In this paper we are investigating 
if HJB based weight update law can address the following two 
issues: 

1) Can the local minima in 3-4-1 network architecture for 
modulo 2 function be avoided? 

2) Can the algorithm converge to the local minimum in 
optimum speed for any problem? 

The rest of the paper has been organized as follows. Sec. HJ 
represents the supervised training of FFNN as a control 
problem and derives optimal weight update law using HJB 
equations. Sec. |III] derives HJB based weight update law 
in instantaneous mode, with a special case of single output 
neural network. Simulation results are presented in Sec. HYl 
A detailed discussion on convergence behavior has been made 
in Sec.|V] Concluding remarks are provided in Sec. |Vll 


II. HJB BASED Offline Learning of FFNN 

The output Yp G for a given input pattern Xp S 
can be written as, 

yp=f(w,Xp) (7) 


Here, w S is the vector of weight parameters involved 
in the FFNN to be trained, with total weight parameters. 
The derivative of y w.r.t. time t is 


yp = 


af(w,xp 

dw 



JpW 


( 8 ) 


where, Jp = is the Jacobian matrix, whose elements 

are Jp^ij = dyp^i/dwj. The desired output is given by y^ = 
f(w,Xp) and its derivative with respect to time is 


yp = JpW = 0. (9) 

The estimation error is Op = y^ — yp and its derivative w.r.t. 
time t is 


= Yp - Yp = -JpW, as yp = 0 (10) 

The optimization of the neural network weights is formu¬ 
lated as a control problem. 

Gp = — JpW = — JpU (11) 

The control input updates the weights as u = w. 

In the batch mode, all the patterns are learnt simultaneously. 
Hence, the dynamics can be written jointly as 


e = —Ju (12) 

where, e = [e[, ej,..., J = [J[, JT^ Hence, 

e is an NgNp x 1 vector, J is an NoNp x matrix and u 
is an x 1 vector. 


A. Optimal Weight Update 

The cost function is defined over time interval (f,T] as 

V{e{t)) = j^ A(e(T),u(r))dr (13) 

where, L(e, u) =-(e''’e-f u''’Ru) (14) 

with R as a constant x matrix. Here, t signifies the 
iterations to update w. Our goal is to hnd an optimal weight 
update law u(f) which minimizes the aforementioned global 
cost function. Hence, we get the following equation which 
is popularly known as the Hamilton-Jacobi-Bellman (HJB) 
equation. 

dV* 

min{—-— e{t) + L{e{t),u{t))} = 0 (15) 

u ae 

Putting the expressions for e{t) and L{e{t),u{t)) from eqs. 
(fl^ and (O, respectively, 

mini-^Ju(f) -f -eUVeit) + -u(f)''’Ru(f)} = 0 (16) 

u de 2 2 

Here, ^ is 1 X NoNp vector. Differentiating with respect to 
u, we get the optimal update law as 

= (17) 


In order to find the expression for > we put the optimal 

u from eq. (fTTl l in eq. (fThl l. 


e{tye{t) 



JR-^JT 



(18) 


A proper solution of eq. (fTsl) should lead to JR ^J’’’ to be 
positive dehnite. 

This is an under-determined system of equations. However, 
the optimal input must stabilize the system. The stability of the 
system can be analyzed with the help of a Lyapunov function 
dehned as 

= 2®^® (19) 

The equilibrium point e = 0 is stable if V(e) is negative 
dehnite. 


V{e{t)) = e^e 

= —e'''Ju*(<) 

= -eTJR-ijT 



( 20 ) 

( 21 ) 

( 22 ) 


If one selects the following form of dV*/de, 
dV* 


de 




(23) 


where, C{t) is chosen to be a positive dehnite matrix, then 
V{e{t)) becomes negative dehnite. In order to hnd the expres¬ 
sion for C(f), we substitute eq. (|2^ in eq. (fTSl) . 


e{ty (I - C(f)TJR-ijTC(f)) e{t) = 0 (24) 


where, I is Np x Np identity matrix. For this to be true for 
all e{t). 


c{ty jR-ijT c{t) = I 


(25) 
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To find a solution for C(f), we decompose = 

USU’’’ into eigenvectors U and a diagonal matrix of eigenval¬ 
ues S. Since, JR~^ is a symmetric positive definite matrix, 
U'''U“UU''' = I and all eigenvalues are positive. Now, C(f) 
can assume the following form to satisfy eq. (|25]l 

C(f)=US-5UT (26) 


This form assures C(f) to be positive definite, so that the 
system is stable around e = 0. However, for numerical stabil¬ 
ity while implementation, a small positive term (= 10“^I) is 
added to S so as to avoid numerical instability. 

Finally, we get the optimal weight update law by combining 
equations ([TtI i and (|2^ as 

w = u*(f) = R^ijTC(f)e(f) (27) 


where, C(t) is given by eq. (l26l l. 

For the FFNN considered in eqs. dill-®, the Jacobian can 
be obtained as follows. For the output layer, 

c)/(w,x„) , , 

^ = 2/p, 12(1 ~ yp,i2)'^il (28) 


and for the hidden layer. 


a/(w,xp) 

dwi^ia 


^ ^ 2/p,i2 (1 
i2 


(29) 


B. Levenberg-Marquardt Modification 

The LM modification is applied to the optimal HJB based 
learning scheme by changing Eq. (l26l l as 

C(f)=US;%T (30) 

where, is a diagonal matrix of the eigenvalues of 
JR~^ JT-i-pI, and U consists of columns as the corresponding 
eigenvectors. Here, I is the identity matrix and /i is the LM 
parameter. When a weight update u leads to an increase in 
error e^e, gL is increased by a fixed factor /3 and a new u is 
computed. This step is repeated until the error decreases and 
the final u is used to update the FFNN weights. On the other 
hand, when a u leads to decrease in error, it is used to update 
the FFNN weights and p, is divided by /3. 

HI. HJB BASED Online Learning for LLNN 

In the online mode, the weight vector is updated after the 
network is presented with a pattern. In this case, the network 
dynamics can be presented as 

y = f(w) (31) 

where the network output is simply observed as a function 
of network weights ONLY. Here the desired output is 
observed and the network response is compared to find 

e = y"" - y. (32) 

Thus the network dynamics can be derived following the 
approach given in Sec. In] 


TABLE II 

Average Success Score, in %, for Modulo-2 Function Learning 


FFNN 

ai'chitecture 

Offline 

Online 

Offline 

BP 

LF 

HJB 

BP 

LF 

HJB 

LM 

HJB-LM 

3 - 4-1 

24 

24 

28 

0 

20 

0 

84 

84 

3 - 6-1 

80 

88 

96 

36 

68 

96 

96 

100 

3 - 8-1 

84 

92 

100 

56 

72 

100 

100 

100 

3 - 10-1 

90 

96 

100 

80 

76 

100 

100 

100 

3-15-1 

100 

100 

100 

100 

80 

100 

100 

100 


The problem is to find u so as to minimize 

1 

V{e{t))= / -(e''’e -I- u''’Ru)(iT (33) 

Jt 2 

One should note that this global cost function is not pertaining 
to any specific pattern rather the network is subjected to 
various inputs while the instantaneous error e is computed 
as per Eq. (l3^ . 

The optimal instantaneous weight update law, as derived 
using the HJB equation and Lyapunov stability criterion, is 
given by 

W = u*(f) = R^ijTC(f)e(f) (34) 

where, C{t) =US-5UT (35) 

with U and S as the eigenvectors and eigenvalues of JR~^ Jt. 


A. Single output Network with Online Learning 

Let us consider a special case of single output network 
trained in an online fashion. Lor such a case J is a 1 x 
matrix, therefore the term JJt is a scalar. Assuming R = rl, 
the corresponding HJB equation can be written as 


e^- 


dV* 

de 


ijjT 

r 


dV* 

de 


dV* 

de 


= 0 

r 

JJT 


(36) 


where, e = yd — y- Therefore, the optimal learning algorithm 
becomes 

u* = . _,, JTe (37) 


(rJJT)i/2' 


IV. Experiments and Discussion 

The efficiency of the proposed HJB based optimal learning 
scheme is substantiated with the help of several benchmark 
learning problems. Notably, the purpose of the presented work 
is to optimize a given cost function and not explicitly to 
design robust classification algorithms. Hence, the following 
experiments aim at optimising the LLNN for a given cost 
function. Nevertheless, in order to show the generalizability 
for classification, several experiments have also been included 
with separate training and testing data over real datasets (Sec. 
HV^ Sec. HV^ . 

The proposed HJB algorithm has been compared with 
popular learning schemes, viz., BP, LL and LM. BP is the 
standard back-propagation learning scheme with a constant 
learning rate 77. LL scheme is a Lyapunov function based 
learning scheme developed by Behera et al. m and has one 



















TABLE III 

Average number of epochs (with average run time) for successful convergence for Modulo-2 Function Learning 
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FFNN 

architecture 

Offline 

Online 

Offline 

BP LF HJB 

rj = 1 fL = 0.05 r = 0.1 

BP LF HJB 

rj = 1 ji = 0.2 r = 5 

LM HJB-LM 

/.I = 10-2,/3 = 10 At = 10-2,/3 = 10 

3-4-1 

3-6-1 

3-8-1 

3-10-1 

3-15-1 

3047 (7.6 s) 1087 (9.5 s) 3519 (6.0 s) 

2880 (4.2 s) 2801 (9.9 s) 190 (0.4 s) 

1309 (2.4 s) 480 (10.4 s) 134 (0.5 s) 

1046 (1.4 s) 447 (0.7 s) 54 (0.09 s) 

2544 (3.5 s) 670 (0.9 s) 49 (0.07 s) 

5843 (10.9 s) 

6210 (11.0 s) 2429 (4.6 s) 2939 (5.8 s) 

6773 (15.0 s) 3565 (8.8 s) 1722 (4.3 s) 

6123 (10.8 s) 3919 (7.3 s) 762 (1.5 s) 

3532 (6.4 s) 4922 (9.5 s) 731 (1.5 s) 

266 (7.0 s) 63 (0.3 s) 

37 (0.15 s) 25 (0.05 s) 

22 (0.06 s) 26 (0.06 s) 

18 (0.05 s) 22 (0.05 s) 

17 (0.05 s) 25 (0.05 s) 


tunable parameter fi. LM algorithm is an offline algorithm 
with two tunable parameters - p, /3. In the proposed offline as 
well as online HJB algorithms, R has been set to rl, where I 
is an identity matrix and r > 0 is a constant tunable parameter. 

All the algorithms have been implemented in MATLAB 
R2013a on an Intel(R) Xeon(R) 2.40GHz CPU with 6GB 
RAM. Each experiment consists of multiple trials, each start¬ 
ing from a random initial point. For a fair comparison, in a 
trial, same initial point is chosen for all the learning schemes. 

A. Modulo-2 function 

The first example is based on the modulo-2 function, defined 
in Eq. (ID- The FFNN is learned using various algorithms in 
offline as well as online modes. The simulation is performed 
for 25 different trials. The results are tabulated in Tables El 
and Un] along with the parameter values used. Table |II] shows 
the success score, as defined in eq. (|6ll, for each algorithm. 
Table [HI] presents the average number of epochs as well as 
average run time for successful convergence of each algorithm. 

It can be observed from Table |II] that, for all these algo¬ 
rithms, achieving 100% success rate with lesser number of 
hidden neurons is difficult, but it improves with the increase 
in the number of hidden neurons. In the case of batch (offline) 
learning, the success score of the proposed HJB algorithm is 
much higher than both BP and LF algorithms. The proposed 
HJB algorithm achieves more than 95% success score with 
6 or more hidden neurons. The offline LM algorithm is able 
to achieve a high success score even with 4 hidden neurons. 
Nevertheless, the HJB-LM algorithm performs even better. It 
achieves 100% success score with 6 or more hidden neurons. 
On the other hand, learning in the online mode is obviously 
more difficult, as shown by lower success scores achieved by 
the online BP, LF and HJB schemes. With 4 hidden neurons, 
BP and HJB schemes were found unable to converge to 
global minimum. Still, HJB scheme achieves more than 95% 
success score with 6 or more hidden neurons. This implies 
high probability for convergence to the global minimum. 

The corresponding average rates of convergence, can be 
seen in Table HIJ In offline mode, the proposed HJB algorithm 
converges much faster than both BP and LF algorithms. The 
number of epochs required for HJB scheme to converge 
reduces with the increase in the number of hidden neurons, 
and accordingly, the training time goes down. However, the 
convergence time improves drastically by introducing the LM 
modification, as shown by the results of HJB-LM scheme. On 
the other hand, for online learning, the number of epochs is 
higher than that for offline learning. Still, the proposed HJB 



Fig. 3. Comparison of convergence rates of diffent algorithms for Modulo-2 
function in offline learning. 


algorithm converges significantly faster as compared to BP and 
LF algorithms. 

The rate of convergence for one trial each of BP, LF and 
HJB schemes has been shown in Fig. |3] The error reduces 
the fastest for the proposed HJB algorithm, followed by LF 
and BP, respectively. The improvement in the convergence rate 
with the proposed HJB scheme over that of the other schemes 
is quite significant, differing by orders of magnitude. 


B. 8-bit Parity 

The 8 dimensional parity problem is also like the XOR 
problem where the output is 1 if the number of ones in the 
input is odd. In this problem, 0 and 1 are replaced by 0.1 
and 0.9 respectively. The network architecture has one bias 
(h- 1) in the input. The simulations are performed for 20 trials 
with different random initialization points. Table IIVI shows the 
simulation results for different algorithms with 9-30-1 FFNN. 
Both BP and LF based algorithms fail to converge in all the 
20 trials. However, the HJB based algorithm converges within 
845 epochs with 95% success score. Interestingly, if one uses 
HJB LM algorithm, the convergence is achieved within 152 
epochs with 100% success score. However, it can be seen 
that LM scheme performs better than the HJB-LM scheme. 
Nevertheless, the convergence rate of LM degrades with the 
increase in the network size. This can be seen in Table |V] 
where the HJB-LM scheme converges faster than even LM 
for a 9-50-1 architecture. 
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TABLE IV 

8 BIT Parity: 9-30-1 architecture 


Algorithm 

Avg Epochs 

Success rate 

Parameters 


BP Offline 

- 

0 % 

7? = 0.05 


LF Offline 

- 

0 % 

fj. = 0.10 


HJB Offline 

845 

95% 

r = 5 


LM 

130 

100 % 

ti = 0.001, ,3 = 

10 

HJB LM 

152 

100 % 

ti = 0.001, /3 = 

10 



TABLE V 




8 bit Parity Problem : 

9-50-1 


Algorithm 

Avg Epochs 

Success rate 

Parameters 


HJB Offline 

465 

100 % 

r = 5 


LM 

129 

100 % 

/r = 0.001,/3 = 

10 

HJB LM 

97 

100 % 

/I = 0.001,/3 = 

10 


TABLE VI 

2D Gabor Eunction: 3-6-1 architecture 



Training Epochs 


Algorithm 

(Online) 


Avg RMS error 


Parameters 

(2UU epochs) 

(4U0 epochs) 

(2U0U epochs) 

BP 

0.1439 

0.1404 

0.0311 

ri = 0.50 

BP 

0.1222 

0.0615 

0.0313 

Tj = 0.95 

LF 

0.0703 

0.0354 

0.0305 

M = 0.35 

HJB 

0.0488 

0.0326 

0.0310 

r = 1.5 

HJB 

0.0479 

0.0369 

0.0360 

r = 1 


C. 2D-Gabor Function 


In this simulation, Gabor function is used for system 
identification problem which is represented by the following 
equation. 


g{xi,x2) 


27r(0.5)2 


X cos(27r(a:i + X 2 )) 


(38) 


For this problem, 100 test data are randomly generated in the 
range [0, 1]. The network architecture used in this problem 
is 3-6-1, with one bias (h- 1) input. The simulation is run for 
online learning till the rms error is 0.01 or a maximum of 2000 
epochs. Table |VT] shows the rms error averaged over 20 runs 
for each method, after 200, 400 and 2000 epochs of learning. 
One can observe that the error decay rate is the fastest for 
the proposed HJB algorithm as compared to that of the other 
algorithms. Figure |4] compares the error decay for HJB, LF 
and BP algorithm with epochs. 


D. Breast Cancer Data 

Breast Cancer problem ll2^ is a classification problem 
where the neural networks is fed with 9 different inputs repre¬ 
senting different medical attributes while the output represents 
class of cancer. The data set has 699 data points of which 600 
have been used for training and the remaining ones for testing. 
The misclassification rate is calculated as the percentage of test 
points wrongly classified by the algorithm. For this problem, a 
network with 10-15-1 architecture, having one bias (h-1) node 
in the input layer, is trained in an online way. The simulations 
are performed for 20 different trials. Table IVIII shows the 
simulation results. Note that BP algorithm is never converging 
to 0.01 rms error within 6000 iterations. Fig. |5] shows the 


Fig. 4. Performance comparison between HJB, LF and BP algorithm for 
2-D Gabor function 


TABLE VII 

Breast Cancer: 10-15-1 architecture 


Algorithm 

Avg. Epochs 

Misclassification rate 

Parameters 

BP Online 

6000* 

2.95 

7? = 0.20 

LF Online 

3058 

2.85 

At = 0.20 

HJB Online 

1020 

2.05 

r = 2.5 


decay of error with successive epochs for HJB, LF and BP 
algorithms. 

E. Credit Approval Data Set 

Credit approval 12^ is also classification problem with 14 
inputs and a binary output. The output represents the approval 
or rejection of credit card. The data set has 690 data points 
of which 80% i.e. 552 points are used for training and the 
rest are used for testing. For this problem, we have selected a 
network with 15-25-1 architecture, with one bias (h-1) node in 



Training Epochs 


Fig. 5. Comparison of error decay for HJB, LF and BP algorithm while 
training for Breast Cancer problem 





































































TABLE VIII 

Credit Approval Data: 15-25-1 architecture 
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TABLE IX 

Weight update laws of various algorithms 


Algorithm 

Avg RMS 
(2000 epochs) 

Avg RMS error 
(1000 epochs) 

Parameters 

BP Online 

0.1826 

0.1991 

7 = 

0.20 

LF Online 

0.1379 

0.1810 

M = 

0.20 

HJB Online 

0.0776 

0.1250 

r = 

: 2.5 
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Fig. 6. Effect of varying the parameter r on the success score and 
convergence time of HJB offline algorithm for modulo-2 function with 3-4-1 
architecture 


input layer. The training is carried out in online fashion for 20 
different trials, starting from different initializations of network 
weights. Table I VllTl shows the simulation results. Since none of 
the methods converged to the rms error of 0.01, the training is 
stopped after 2000 epochs. We have also provided the average 
rms error after 1000 epochs for each algorithm. We can see 
that the HJB based algorithm reaches to a better rms error 
within 1000 epochs as compared to the error for BP and LF 
based algorithm. Also, note that the BP algorithm does not 
make much improvement from 1000 to next 2000 epochs. 


F. Parameter Tuning 

The proposed HJB offline and online algorithms have one 
tunable parameter, viz., r. The value of this parameter plays 
a crucial role in determining the convergence properties, as 
shown in Fig. |6] Notably, BP and LF algorithms also have 
one parameter each, i.e., p and gL, respectively. On the other 
hand, LM and the proposed HJB-LM algorithms have two 
parameters each. 

Derivation of the optimal values of the parameters has 
not been dealt with in this work. For the experiments, the 
parameters have been set manually. 

V. Convergence Behavior 

This section analyzes the convergence behavior of various 
algorithms used in this paper. The iterative update laws of 
these algorithms are summarized in Table HXl 


Algorithm 

Update law, w = u 

BP 

u = r/JT e 

LF 


LM 

u = [JTJ + /^I]-ljTe 

HJB 

u = ijTCe 

single output online HJB 

U = -- ^ .T /o 



w 

Fig. 7. An example error function to be minimized with respect to w 


A. Global Minimization 


Let us analyze the global convergence properties of the 
proposed HJB scheme. The HJB equations for minimizing the 
cost function of eq. (fTTT i provide us with eq. (ITsT l. which is 
underdetermined, i.e. not sufficient to derive an optimal u*(f). 
Hence, we take help of Lyapunov function to impose further 
conditions on the solution. However, it turns out that these 
conditions compel the error e{t) to reduce at each step. This 
behaves like a greedy search scheme, ending up in the local 
minimum. On the contrary, the convergence to global optimum 
may sometimes require e{t) to increase also. 

This can be analyzed more elaborately with the help of an 
example. Consider the following single output function with 
single parameter, as plotted in Fig. Q 


e = -f{w) = 


w 


w 

”3 


--- w 


(39) 


While minimizing e with respect to w, we use the formulation 
described in Sec. IIII-AI Eq. ( |36] | gives 


u*(t) = -J 
r 



dV* 

de 



(40) 

(41) 


where, J = df/dw = —de/dw. Eq. (HTT) has two solutions, 
and HJB formulation does not determine which one to choose. 
Eor global convergence, w should end up in the value 2. 

Consider the case when w G (—1,0). Here, dV*/de < 0 
because if e increases, the global cost function V* decreases. 
This makes u > 0 as J < 0, taking w to global minimum. On 
the contrary, imposing stability condition with the Lyapunov 
criterion makes dV*/de > 0, assuming that a decrease in error 
decreases the global cost function V. But this assumption is 
not true here. This makes u < 0 as J < 0, taking w to -1, 
i.e. a local minimum. Hence, HJB formulation does not have 


































FFNN 


CRITIC 




de 


Fig. 8. Single network adaptive critic for estimating the optimal weight 
update scheme 


a provision to choose among the positive or negative form for 
dV*/de in eq. dTO . Since Lyapunov condition selects always 
the positive form, HJB with the help of Lyapunov condition 
cannot guarantee global convergence. Still, the success score is 
found to increase with HJB as compared to other algorithms. 
This implies that the local minima are often bypassed due to 
the greedy nature of the convergence. 

For determining the exact form of dV*/de, we have 
performed some preliminary simulation experiments using a 
single network adaptive critic Il20l . Il24l . Il25l (SNAC), as 
shown in Fig. |8] SNAC makes use of the discrete time HJB 
equation as explained below. 

As in Sec.ini] the discrete time state dynamics for the online 
learning of FFNN (refer to eq. O) can be written as, 

^ k^k ( 42 ) 

where, k is the discrete time index and the time step is assumed 
to be 1 unit. The discrete time global cost function is 

OO ^ 

k^l 

The optimal cost-to-go function is, 

Vk = mjn{^(efeefe + ujufe) + 
which, on differentiation with respect to and e^, gives 

Ufc = (43) 

Afe = Ofc + Afc+i (44) 

where, Xk = dVk/dek- A simple 2-layer FFNN, called Critic 
network, is used to estimate A^+i as a function of (xk), subject 
to eq. (l44l i. The implementation details can be found in Behera 
and Kar Il20l . 

The SNAC based scheme has been implemented for 

modulo-2 function in online mode with a 3-4-1 (h- 1 bias in 
input) FFNN, with a 2-6-1 (h- 1 bias in input) FFNN as critic 
network. The scheme shows a success score of 80% over 
20 trials. This is quite large as compared to other online 
algorithms shown in Table [III where online HJB scheme 
has 0% success score and batch mode HJB has only 28% 
success score. However, these results are very rudimentary. 
Nevertheless, they show that a properly tuned adaptive critic 
can possibly achieve 100% success score. This investigation 
forms the future scope of this paper. 


Now let us analyze the global convergence behavior of other 
algorithms. Any deterministic weight update law considered 
in this paper should take the weights to local minimum if 
the corresponding Lyapunov function is negative semi-definite. 
The Lyapunov function is given by V(e) = ^e^e. The 
equilibrium point e = 0 is asymptotically stable if V(e) is 
strictly negative definite. For any deterministic weight update 
scheme for a general FFNN, e can be expressed as in Eq. (fTSI) . 
Thus, 


l)(e) = —e'''Ju(f) (45) 

For HJB scheme, 

V(e) = -ieTJJTCe (46) 

r 

is negative semi-definite as JJ’’’ can be positive semi-definite. 
For LM scheme also 


V(e) = -eTJ[JTJ + ;iI]-ijTe (47) 

is negative semi-definite, as J = 0 at local minima. Moreover, 
it has already been pointed out in Behera et aJ. 13 that the LF 
algorithm leads to a negative semi-definite Lyapunov function. 
This implies that the above algorithms cannot guarantee global 
convergence. 


B. Computation time and complexity 

It has been observed that the proposed HJB algorithm is 
much faster than the other algorithms for both online as well 
as offline learning. It will be useful at this stage to compare 
the computational complexity of one iteration of each of these 
algorithms. All the algorithms discussed in this paper are based 
on deterministic iterative update of weights. Moreover, all of 
these involve the calculation of the Jacobian J. They mostly 
differ in their weight update schemes, u. 

Certainly, a single iteration of HJB and LM schemes in¬ 
volves more computations than BP and LF schemes. However, 
the number of iterations required for HJB and LM are signifi¬ 
cantly small as compared to those for BP and LM. Hence, the 
former ones take much less convergence time than the later 
ones. 

The Jacobian J is an NoNp x matrix. HJB involves 
calculation of the eigenvalues of JJ’’’ (NgNp x NoNp), while 
LM involves calculation of the inverse of J^J x Nm). 
Hence, in the offline mode, where Np is large, one iteration 
of LM is supposed to take lesser time than HJB. Nevertheless, 
the number of iterations required for HJB is lesser than that 
for LM, leading to reduction in the computation time for the 
former. Moreover, as the size of network, N^, increases, the 
performance of LM deteriorates quickly; but since JJ''' is 
independent of N^, HJB is not affected drastically. On the 
other hand, in online mode, Np = 1, and hence, HJB scheme 
becomes even faster. 


VI. Conclusion 

HJB equation is well known for dynamic optimization. 
In this paper, the weight update process in FFNN has been 
formulated as a dynamic optimization problem. Applying HJB 
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equations, the optimal weight update laws have been derived 
in both batch and online (instantaneous) modes. However, the 
HJB equation turns out to be under-determined, i.e. dVjde 
has multiple solutions, w.r.t. eq. (fTsT i and it turns out that 
we have to impose additional conditions to ensure global 
minimization. A proper choice of this solution also requires 
the direction where the global minimum is located. This 
issue requires further investigation. This paper analytically 
analyzes the convergence behavior of various popular learning 
algorithms as why they are liable to be stuck in local minima. 
It is shown that the Lyapunov function drags the solution to 
the local minimum. We have performed an initial simulation 
experiment using single network adaptive critic ll20l to find 
the global convergence for modulo-2 function. The results 
are very encouraging and will form the basis of our further 
investigation. 

Nevertheless, although the global optimization issue remains 
unresolved, the proposed scheme ensures faster convergence 
rate as compared to the existing schemes, including LM. 
Moreover, the weight update law using HJB equation has been 
derived for both offline as well as online modes, whereas 
LM can be applied only in the offline mode. Our study 
also shows that the proposed HJB scheme works well even 
with large network size, while the LM method deteriorates 
in performance as the size of the network increases, as also 
observed by Xie etaJ 1261 . 
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